-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[d3d11] Use host memory in deferred context for "small" updates. #1805
Conversation
Instead of spinning like crazy in the slice allocator, thrashing the CPU caches while fighting other deferred context threads. Using the heap here gets us ~15 more fps in Dark Souls III.
What kind of hardware is this a problem on? I've never seen Dark Souls 3 perform poorly in comparison to Windows, even on my 10-year old Phenom II X6 back in the day. I really don't like the idea of introducing memory allocations and memcpys on a hot path that we're going to be hitting every single time in the vast majority of games using Deferred Contexts. In fact, doing that is what made some other games (like Diablo III iirc) run very slowly, which is why the change was made in the first place. This looks like it does exactly the same thing as the code path in case |
I'm running on a laptop with a Intel i7-8565U CPU (and a NVIDIA 1070 e-GPU but it's not the bottleneck here), and perf reports 30% CPU usage on the related atomic ops. I haven't tried moving everything through the other code path, I thought AllocUpdateBufferSlice was also allocating a slice, but it seems after a better look that it could fit. However, I initially didn't limit the size of the updates, and Sekiro then performed very badly for instance. I think the issue is only with "small" updates that apparently are used a lot in DS3. |
The problem in general is that these small updates tend to be made more very frequently as well if a game doesn't make use of I'm also not sure how much of a problem lock contention really is for most games, it seems like DS3 just uses one single dynamic constant buffer for just about everything but that's not the norm either. |
I think the issue is not so much lock contention but rather cache thrashing. For instance using the heap makes the threads fight each other as well but it ends up being nicer to the cache. Using a std::mutex instead of a spinlock for the slice allocator also helps a bit but then there's still some thrashing visible on buffer refcount. |
I tried the option, and disabling single use mode for this game improves performance in the same way on my hardware. Would it make sense to add it to the default config? |
We can do that, but I'd like to do get some numbers from different hardware configs. This game is unfortunately very annoying to benchmark because it's locked to 60 FPS. What kind of frame rates are you getting before and after enabling it? |
Right at the start of the game, all settings to low and 1280x720, without moving the camera it was hardly above 45fps before with ~25% GPU usage (reported on the HUD), and enabling it makes it reach 60fps with ~40% GPU usage. It's not obviously capped though, and moving the camera changes the numbers too, in particular it's better in both cases with less geometry on screen. Although it's much better with the single use mode disabled, perf still shows 11% CPU usage in Note that using the heap like I did here didn't change much the perf report, although the 10% CPU time is then spent waiting on the global heap cs (or thread local heap atomic ops if I use it instead). |
Hi Rémi, this modification you made, improved the fps stability in Dark Souls Remastered for me, now I get 60 fps constant, here before and after Dark Souls Remastered in FullHD with 720p resolution scale with only MotionBlur and FXAA enabled. IIn Firelink Shrine it was one of the areas that most varied the fps for me, thank you. |
Nice to know. As discussed above there's already a |
@ViNi-Arco what's your hardware configuration? I wonder if (besides contesting the one constant buffer) this has anything to do with latencies involved when writing data to external GPUs directly. DXVK is known to not do particularly well in such configurations. |
Hello Mr. Philip, I have an Intel Q9650 with DDR3@1333 with Tight Timings with Command Rate in 1T.
Edit-2: This higher CPU usage probably was a Edit: I tested it on Far Cry 4, and there was no difference, that must be specific in some games, maybe. I only have these two games in DX11 to test at the moment. |
I'm closing this, as it seems it's hardly useful. People using it as a patch, should probably not. |
I'm not sure of the lifetime of things involved here, so I'm opening the PR to get some feedback. Dark Souls III performance is clearly hitting a bottleneck here, with multiple threads fighting for slices and thrashing the CPU cache in several places: the free slice spinlock, but atomic operations on buffer refcount as well.
I tried a few games with it and it seems to be working alright, but didn't really test extensively.